Principal Component Analysis (PCA) + Kernel PCA#
PCA is one of the most useful “first moves” in data science:
compress data while keeping what varies the most
denoise (sometimes) by dropping low-variance directions
visualize high-dimensional data in 2D/3D
understand how ideas like eigenvectors show up in real ML
This notebook mirrors the style of the supervised-learning notebooks: intuition first, then the math, then a from-scratch implementation + visuals.
Learning goals#
By the end you should be able to:
explain PCA in plain language (“rotate the camera to the widest view”)
derive PCA from variance maximization and the covariance matrix
implement PCA from scratch using NumPy (center → covariance → eigendecompose → project)
understand Kernel PCA and the kernel trick at a high level
interpret explained variance and know when PCA can mislead you
Notation (quick)#
Data matrix: \(X \in \mathbb{R}^{n\times d}\) (rows = samples, columns = features)
Centered data: \(X_c = X - \mathbf{1}\mu^T\) where \(\mu\) is the feature-wise mean
Covariance matrix: \(\Sigma = \frac{1}{n-1} X_c^T X_c \in \mathbb{R}^{d\times d}\)
Principal axes (directions): columns of \(V\) (or rows of
components_)Scores (coordinates in PC space): \(Z = X_c V_k\)
Table of contents#
Intuition (very simple)
Statistical & linear algebra explanation
Step-by-step algorithm (from scratch)
Kernel PCA extension (why + how)
Visual explanations (Plotly)
When PCA works / fails
Comparison table (PCA vs Kernel PCA vs ICA)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from dataclasses import dataclass
from plotly.subplots import make_subplots
from sklearn.datasets import make_moons
from sklearn.decomposition import PCA, KernelPCA
from sklearn.preprocessing import StandardScaler
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Intuition (very simple)#
Explain PCA like you’re 10:
Imagine you have a handful of dots on the floor and you want to take a picture that shows the dots as “spread out” as possible.
Metaphor 1: “rotate the camera to see the widest view”#
If you point your camera the wrong way, the dots might look squished.
If you rotate the camera to the best angle, the dots look most spread out.
PCA is the math version of:
“Rotate the axes until the data looks widest in one direction.”
That “widest direction” is the first principal component.
Metaphor 2: “shadow on the wall”#
Hold a 3D object in front of a lamp. The object casts a shadow on the wall.
Different angles give different shadows.
PCA chooses the angle where the shadow is most spread out.
Then PCA can also choose a second direction (at a right angle) that’s the next most spread out shadow.
What PCA is not#
It doesn’t find “the best features” in a semantic sense.
It finds directions of maximum variance (biggest spread), which is a statistical idea.
2) Statistical & linear algebra explanation#
2.1 Covariance matrix (what PCA is built from)#
After centering, the covariance matrix is:
Diagonal entries: variances of each feature
Off-diagonal entries: how two features vary together
PCA looks for directions (unit vectors) \(w\) where the projected variance is large.
2.2 Variance maximization → eigenvectors#
Project data onto a unit direction \(w \in \mathbb{R}^d\):
The variance of \(z\) is:
PCA’s first component solves:
Using a Lagrange multiplier (constraint \(\|w\|^2 = 1\)), you get:
So:
eigenvectors of \(\Sigma\) are candidate directions
eigenvalues tell you how much variance is along that direction
The largest eigenvalue’s eigenvector is the first principal component.
2.3 Orthogonality (why PCs are at right angles)#
The covariance matrix \(\Sigma\) is symmetric, so it has an orthonormal eigenbasis:
eigenvectors are orthogonal (perpendicular)
we can choose them to have unit length
This gives you a clean story:
PC1 captures the most variance
PC2 captures the most remaining variance subject to being orthogonal to PC1
and so on
2.4 Relationship to SVD (another way to compute PCA)#
Take the SVD of the centered data:
Then:
So:
the columns of \(V\) are the PCA directions (principal axes)
the eigenvalues of \(\Sigma\) are \(\lambda_i = \frac{S_i^2}{n-1}\)
In practice, SVD is often the numerically stable way to compute PCA.
3) Step-by-step algorithm (from scratch)#
Given \(X \in \mathbb{R}^{n\times d}\):
Center the data: \(X_c = X - \mu\)
Compute covariance: \(\Sigma = \frac{1}{n-1} X_c^T X_c\)
Eigen-decompose: \(\Sigma v_i = \lambda_i v_i\)
Sort eigenvalues descending, keep the top \(k\)
Project: \(Z = X_c V_k\)
Where \(V_k\) is the matrix of the top-\(k\) eigenvectors.
Below we’ll implement PCA in NumPy and sanity-check it against sklearn.
# A 2D dataset with strong correlation (nice for PCA visuals)
n = 350
Z = rng.normal(size=(n, 2))
A = np.array(
[
[3.0, 1.2],
[0.0, 0.7],
]
)
X2 = Z @ A.T
X2 = X2 + np.array([2.0, -1.0])
fig = px.scatter(
x=X2[:, 0],
y=X2[:, 1],
title="Correlated 2D data (we'll find the best rotation)",
labels={"x": "x1", "y": "x2"},
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
@dataclass
class PCAFit:
mean_: np.ndarray
components_: np.ndarray # shape: (k, d)
explained_variance_: np.ndarray # shape: (k,)
explained_variance_ratio_: np.ndarray # shape: (k,)
def transform(self, X: np.ndarray) -> np.ndarray:
Xc = X - self.mean_
return Xc @ self.components_.T
def inverse_transform(self, Z: np.ndarray) -> np.ndarray:
return Z @ self.components_ + self.mean_
def pca_fit(X: np.ndarray, n_components: int | None = None) -> PCAFit:
X = np.asarray(X, dtype=float)
n_samples, n_features = X.shape
mean = X.mean(axis=0)
Xc = X - mean
cov = (Xc.T @ Xc) / (n_samples - 1)
# Symmetric matrix => use eigh (stable, returns ascending eigenvalues)
eigvals, eigvecs = np.linalg.eigh(cov)
order = np.argsort(eigvals)[::-1]
eigvals = eigvals[order]
eigvecs = eigvecs[:, order]
total_var = eigvals.sum()
if n_components is None:
k = n_features
else:
k = int(n_components)
components = eigvecs[:, :k].T
explained_variance = eigvals[:k]
explained_variance_ratio = explained_variance / total_var
return PCAFit(
mean_=mean,
components_=components,
explained_variance_=explained_variance,
explained_variance_ratio_=explained_variance_ratio,
)
pca2 = pca_fit(X2, n_components=2)
pca2.explained_variance_ratio_
array([0.9705, 0.0295])
sk_pca2 = PCA(n_components=2)
Z_sklearn = sk_pca2.fit_transform(X2)
print("from scratch explained variance ratio:", pca2.explained_variance_ratio_)
print("sklearn explained variance ratio: ", sk_pca2.explained_variance_ratio_)
# Compare directions up to sign (PCA axes can flip sign and still be 'the same')
print()
print("from scratch components_ (rows):")
print(pca2.components_)
print()
print("sklearn components_ (rows):")
print(sk_pca2.components_)
from scratch explained variance ratio: [0.9705 0.0295]
sklearn explained variance ratio: [0.9705 0.0295]
from scratch components_ (rows):
[[-0.9954 -0.0953]
[ 0.0953 -0.9954]]
sklearn components_ (rows):
[[ 0.9954 0.0953]
[-0.0953 0.9954]]
4) Kernel PCA extension (non-linear structure)#
4.1 Motivation: when linear PCA is the wrong shape#
Linear PCA can only:
rotate your coordinate system
then optionally drop dimensions
So if your data lives on a curved shape (like two moons), linear PCA can’t “unbend” it.
Kernel PCA keeps the PCA idea (maximize variance) but does it in a feature space where the data may become more linear.
4.2 Kernel trick intuition#
PCA needs dot products like \(\langle x, z \rangle\).
Kernel trick idea:
pretend we mapped points into a huge feature space: \(\phi(x)\)
but instead of computing \(\phi(x)\) explicitly, we compute:
Common kernels:
RBF (Gaussian): \(k(x,z)=\exp(-\gamma\|x-z\|^2)\)
Polynomial: \(k(x,z)=(x^T z + c)^p\)
4.3 Mathematical formulation (high level)#
Let \(\Phi\) be the centered feature-space data matrix (rows are \(\phi(x_i)\)). Kernel PCA avoids forming \(\Phi\) and works with the kernel matrix:
We must center it (equivalent to centering in feature space):
Then we eigendecompose:
A point \(x\) is projected using kernel similarities to training points:
(Details vary by normalization convention, but the core idea is: projection = weighted sum of kernel evaluations.)
4.4 Linear PCA vs Kernel PCA (conceptually)#
Linear PCA: pick a straight line / plane that captures the most spread.
Kernel PCA: first “bend” space using a kernel (implicitly), then do PCA there.
5) Visual explanations (Plotly)#
We’ll build four interactive visuals:
Original data vs principal axes
Explained variance bar chart
2D → 1D projection animation
Linear vs Kernel PCA comparison
# 5.1 Original data + principal axes (PC1, PC2)
mu = pca2.mean_
pc1, pc2 = pca2.components_[0], pca2.components_[1]
# Scale arrows by a few standard deviations along each PC
s1 = 3.0 * float(np.sqrt(pca2.explained_variance_[0]))
s2 = 3.0 * float(np.sqrt(pca2.explained_variance_[1]))
p1a, p1b = mu - s1 * pc1, mu + s1 * pc1
p2a, p2b = mu - s2 * pc2, mu + s2 * pc2
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=X2[:, 0],
y=X2[:, 1],
mode="markers",
marker=dict(size=7, opacity=0.75, line=dict(width=0.5, color="white")),
name="data",
)
)
fig.add_trace(go.Scatter(x=[p1a[0], p1b[0]], y=[p1a[1], p1b[1]], mode="lines", name="PC1"))
fig.add_trace(go.Scatter(x=[p2a[0], p2b[0]], y=[p2a[1], p2b[1]], mode="lines", name="PC2"))
fig.add_trace(
go.Scatter(
x=[mu[0]],
y=[mu[1]],
mode="markers",
marker=dict(size=10, symbol="x", color="black"),
name="mean",
)
)
fig.update_layout(
title="PCA on 2D data: principal axes are the best 'camera rotation'",
xaxis_title="x1",
yaxis_title="x2",
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
# 5.2 Explained variance (a slightly higher-dimensional example)
n = 600
latent = rng.normal(size=(n, 2))
latent[:, 0] *= 3.0 # big variance direction
latent[:, 1] *= 1.0
W = rng.normal(size=(2, 6))
X6 = latent @ W + 0.25 * rng.normal(size=(n, 6))
pca6 = pca_fit(X6, n_components=6)
ratios = pca6.explained_variance_ratio_
cum = np.cumsum(ratios)
pcs = [f"PC{i+1}" for i in range(len(ratios))]
fig = go.Figure()
fig.add_trace(go.Bar(x=pcs, y=ratios, name="explained variance"))
fig.add_trace(go.Scatter(x=pcs, y=cum, mode="lines+markers", name="cumulative"))
fig.update_layout(
title="Explained variance ratio",
yaxis_title="fraction of total variance",
)
fig.update_yaxes(tickformat=".0%")
fig.show()
# 5.3 2D → 1D projection animation (onto PC1)
Xc = X2 - pca2.mean_
z1 = Xc @ pca2.components_[0]
X_proj = pca2.mean_ + np.outer(z1, pca2.components_[0])
# A line representing PC1
pc1_dir = pca2.components_[0]
span = 4.0 * float(np.sqrt(pca2.explained_variance_[0]))
line_a, line_b = pca2.mean_ - span * pc1_dir, pca2.mean_ + span * pc1_dir
x_min = float(min(X2[:, 0].min(), X_proj[:, 0].min()) - 0.5)
x_max = float(max(X2[:, 0].max(), X_proj[:, 0].max()) + 0.5)
y_min = float(min(X2[:, 1].min(), X_proj[:, 1].min()) - 0.5)
y_max = float(max(X2[:, 1].max(), X_proj[:, 1].max()) + 0.5)
ts = np.linspace(0.0, 1.0, 21)
frames = []
for t in ts:
Xt = (1 - t) * X2 + t * X_proj
frames.append(
go.Frame(
data=[go.Scatter(x=Xt[:, 0], y=Xt[:, 1])],
name=f"t={t:.2f}",
traces=[0],
)
)
fig = go.Figure(
data=[
go.Scatter(
x=X2[:, 0],
y=X2[:, 1],
mode="markers",
marker=dict(size=7, opacity=0.75, line=dict(width=0.5, color="white")),
name="points",
),
go.Scatter(
x=[line_a[0], line_b[0]],
y=[line_a[1], line_b[1]],
mode="lines",
name="PC1 axis",
),
],
frames=frames,
)
fig.update_layout(
title="Projecting 2D points down to 1D (onto PC1)",
xaxis_title="x1",
yaxis_title="x2",
updatemenus=[
{
"type": "buttons",
"showactive": False,
"buttons": [
{
"label": "Play",
"method": "animate",
"args": [None, {"frame": {"duration": 80, "redraw": True}, "fromcurrent": True}],
},
{
"label": "Pause",
"method": "animate",
"args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
},
],
}
],
sliders=[
{
"steps": [
{
"method": "animate",
"args": [[f"t={t:.2f}"], {"mode": "immediate", "frame": {"duration": 0, "redraw": True}}],
"label": f"{t:.2f}",
}
for t in ts
],
"currentvalue": {"prefix": "t = "},
}
],
)
fig.update_xaxes(range=[x_min, x_max])
fig.update_yaxes(range=[y_min, y_max], scaleanchor="x", scaleratio=1)
fig.show()
# 5.4 Linear PCA vs Kernel PCA on a non-linear dataset (two moons)
X_moons, y_moons = make_moons(n_samples=500, noise=0.12, random_state=7)
Xs = StandardScaler().fit_transform(X_moons)
Z_lin = PCA(n_components=2).fit_transform(Xs)
Z_k = KernelPCA(n_components=2, kernel="rbf", gamma=10.0).fit_transform(Xs)
colors = {0: "#1f77b4", 1: "#d62728"}
fig = make_subplots(
rows=1,
cols=3,
subplot_titles=[
"Original space (2D)",
"Linear PCA embedding",
"Kernel PCA embedding (RBF)",
],
)
for label in [0, 1]:
m = y_moons == label
fig.add_trace(
go.Scatter(
x=Xs[m, 0],
y=Xs[m, 1],
mode="markers",
marker=dict(size=6, color=colors[label], opacity=0.75),
name=f"class {label}",
showlegend=True,
),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(
x=Z_lin[m, 0],
y=Z_lin[m, 1],
mode="markers",
marker=dict(size=6, color=colors[label], opacity=0.75),
name=f"class {label}",
showlegend=False,
),
row=1,
col=2,
)
fig.add_trace(
go.Scatter(
x=Z_k[m, 0],
y=Z_k[m, 1],
mode="markers",
marker=dict(size=6, color=colors[label], opacity=0.75),
name=f"class {label}",
showlegend=False,
),
row=1,
col=3,
)
fig.update_xaxes(title_text="x1 (scaled)", row=1, col=1)
fig.update_yaxes(title_text="x2 (scaled)", row=1, col=1)
fig.update_xaxes(title_text="PC1", row=1, col=2)
fig.update_yaxes(title_text="PC2", row=1, col=2)
fig.update_xaxes(title_text="KPCA1", row=1, col=3)
fig.update_yaxes(title_text="KPCA2", row=1, col=3)
fig.update_layout(
title="Linear vs Kernel PCA: kernel can 'unfold' non-linear structure",
legend_title_text="",
)
fig.show()
6) When PCA works / fails#
When PCA tends to work well#
You believe the “signal” is in large-variance directions.
Features are roughly on the same scale (or you standardize).
You want a quick baseline for compression/visualization.
Common failure modes#
Noise sensitivity#
PCA is variance-hungry: if noise creates variance, PCA may chase it.
In very noisy settings, PC1 can become “the noisiest direction”, not “the most meaningful direction”.
Scaling issues#
PCA is not unitless — it depends on variance.
If one feature is measured in dollars and another in cents, the dollar feature can dominate.
Fix: standardize (
StandardScaler) or use a domain-appropriate scaling.
Linear limitation#
PCA can only find linear directions.
If the data lies on a curve/manifold (moons, circles), linear PCA won’t separate the structure.
Fix: Kernel PCA (or other non-linear methods like UMAP/t-SNE for visualization).
Practical checklist#
Always center your data.
Decide whether you need scaling:
yes for mixed units / mixed scales
sometimes no if scale itself is meaningful
Pick
n_componentsusing explained variance and downstream performance.Don’t over-interpret PCs as “real factors” without domain validation.
# A quick scaling demo: one feature has a much larger numeric scale
n = 400
x1 = rng.normal(0, 1.0, size=n)
# Same underlying signal, but scaled by 50x
x2 = 50.0 * (0.8 * x1 + rng.normal(0, 0.4, size=n))
X_scale = np.c_[x1, x2]
p_unscaled = pca_fit(X_scale, n_components=2)
Xs_scaled = StandardScaler().fit_transform(X_scale)
p_scaled = pca_fit(Xs_scaled, n_components=2)
print("Explained variance ratio (unscaled):", p_unscaled.explained_variance_ratio_)
print("Explained variance ratio (scaled): ", p_scaled.explained_variance_ratio_)
Explained variance ratio (unscaled): [0.9999 0.0001]
Explained variance ratio (scaled): [0.9427 0.0573]
7) Comparison table: PCA vs Kernel PCA vs ICA#
Method |
What it tries to do |
Linear? |
What “components” mean |
Typical uses |
Key gotchas |
|---|---|---|---|---|---|
PCA |
Maximize variance with orthogonal directions |
✅ |
Orthogonal axes capturing decreasing variance |
Compression, denoising, visualization, preprocessing |
Sensitive to scaling; can chase noise; linear only |
Kernel PCA |
PCA in an implicit feature space via a kernel |
❌ (can be non-linear) |
Non-linear directions in feature space (via kernel eigenvectors) |
Unfold curved structure; non-linear embeddings |
Kernel/ |
ICA |
Separate statistically independent sources |
✅ (typically) |
Directions that make components as independent as possible |
Blind source separation (signals), artifact removal |
More assumptions; components not ordered by variance; sign/scale ambiguities |
Exercises#
Implement PCA via SVD and verify it matches the covariance-eigendecomposition approach.
On the moons dataset, sweep
gammain Kernel PCA and observe when the embedding “unfolds” vs when it becomes noisy.Create a dataset where the highest variance direction is pure noise — see PCA fail on purpose.
References#
ESL (Hastie, Tibshirani, Friedman): The Elements of Statistical Learning
Bishop: Pattern Recognition and Machine Learning
Schölkopf, Smola, Müller (1998): Nonlinear Component Analysis as a Kernel Eigenvalue Problem (Kernel PCA)
sklearndocs:PCA,KernelPCA,FastICA